##Contents of this folder
readme.txt: 
This file. Description document of my algorithm.

QuakePredictor_documented.java:
Source code of final submission together with additional comment.
Please read this if you want to know detailed behavior of my algorithm.

model_c.txt:
Result of offline training. A part of java source code.
This code is copied to Predictor class of final submission source.

DataCreator.java, combine.py, training.py, parse_xgboost.py:
Used in offline training. Detailed descriptions are there in later section.


##Concept of the algorithm
#Overview
The base idea is that low frequency(0.01-1Hz) pulses appears days to weeks before the quake
(mentioned in TopCoder Blog (https://www.topcoder.com/blog/make-a-difference-win-prizes-in-nasas-quest-for-quakes/)),
and classifying wave feature into that of some weeks before the quake or the other is important.

My algorithm extract the frequency characteristics of the magnetrometer signals for each site and time.
EMA value of the hour is also used as a feature.
Probability of quake for the moment is predicted by decision tree array created by offline training.
The result of the hour is derived by moving average like calculation.

#Features
FFT is applied to the magnetrometer signals for each axis for every 2^13 samples (Corresponding to 3-5 minutes).
Length of the transformation target is 2^14 samples (5-10 minutes).
(e.g. 1st target is 0:00-0:10, 2nd target is 0:05-0:15, …)

Amplitudes are summed up through these 14 range of frequency(Hz).  
0, 0-0.004, 0.004-0.005, 0.005-0.006, 0.006-0.008, 0.008-0.01, 0.01-0.02, 0.02-0.05, 0.05-0.1, 0.1-0.2, 0.2-0.5, 0.5-1, 1-2, 2-5
Then these 14 values are used as Frequency characteristics of the axis.

feature0 -feature13: Frequency characteristics(0, 0-0.004, …) of axis0
feature14-feature27: Frequency characteristics of axis1
feature28-feature41: Frequency characteristics of axis2
feature42-feature55: feature0 /feature14, feature1 /feature15, …
feature56-feature69: feature14/feature28, feature15/feature29, …
feature70-feature83: feature28/feature0 , feature29/feature1,  …
feature84          : EMA


##Offline training
#Processes
1. Extract and label the features (DataCreator.java):
Same as testing, I extracted features for all seeds, all sites.
To reduce the training time, only hours from 6 weeks before the end of the data set were used.

Features were labeled by whether quake occurs within 3 weeks (labeled as 1) or quake does not occur (labeled as 0).
Rows that of quake occurs between 3-6 weeks were not used.
(In this step labels may have other values. In process 3, those values will be modified to these values.)
So, predictor that is created by this training will predict quake occurs within 3 weeks or quake does not occur.

This process outputs csv files for each seed.

2. Merge and shrink the cdv file (merge.py):
Csv files from process 1 are merged to 1 file.
To reduce the training time, rows are dropped randomly.

3. Create decision tree array (training.py)
Create Gradient Boosting Decision Tree using xgboost.
(parameters are determined by my intuition and some trial and error.)
8 slices of input * 3 trees = 24 trees are created (24 is in order to avoid the code size limit of final submission).
Models are dumped to text file.

4. Parse xgboost model text to java (parse_xgboost.py)
Parse model text to model_c.txt.

#Dependencies
JDK 1.8.0
python 3.4.3
numpy 1.9.2
scikit-learn 0.15.2
pandas 0.16.1
xgboost 0.4

#Steps for offline training (to recreate model_c.txt)
1. Install all dependencies.
2. Change current directory to this folder.
3. Copy gtf.csv to current directory.
4. Create data/ directory.
5. Copy training data to data/ directory. (ex. data/4/)
6. Run commands listed below. Then model_c.txt will be created.
javac DataCreator.java
java DataCreator
python combine.py
python training.py
python parse_xgboost.py


##What is the highest score you could have reached in this contest without using the magnetometer signal?
Though I could not find out how to use sitesData and otherQuakes,
I think otherQuakes and gtf might enable us to create database of quakes and
we might be able to search quakes from the database by using sitesData and otherQuakes.